Datasets

DataPrep provides a collections of datasets. You could easily load them using one line of code and explore the functionalities of dataprep on them.

List Available Datasets

You could list the name of all available datasets by calling get_dataset_names, as shown in below.

[1]:
from dataprep.datasets import get_dataset_names
get_dataset_names()
[1]:
['adult',
 'house_prices_test',
 'house_prices_train',
 'iris',
 'titanic',
 'waste_hauler',
 'wine-quality-red']

Load Dataset

After you know the available dataset names from get_dataset_names. Next you could load the dataset by calling load_dataset.

[2]:
from dataprep.datasets import load_dataset
df = load_dataset("titanic")
df
[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Analyze Dataset

After you get the dataset, you could try to use dataprep to explore the dataset. For example, you may want to create a profiling report of the dataset using dataprep.eda.

[3]:
from dataprep.eda import create_report
report = create_report(df)
report
Generating new fontManager, this may take some time...
[3]:
DataPrep Report

Overview

Dataset Statistics

Number of Variables 12
Number of Rows 891
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 646.0 KB
Average Row Size in Memory 742.4 B

Variable Types

Categorical 12

Variables

PassengerId

numerical

Distinct Count 891
Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 13.9 KB
Mean 446
Minimum 1
Maximum 891
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 1
5-th Percentile 45.5
Q1 223.5
Median 446
Q3 668.5
95-th Percentile 846.5
Maximum 891
Range 890
IQR 445

Descriptive Statistics

Mean 446
Standard Deviation 257.3538
Variance 66231
Sum 397386
Skewness 0
Kurtosis -1.2
Coefficient of Variation 0.577

Survived

numerical

Distinct Count 2
Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 13.9 KB
Mean 0.3838
Minimum 0
Maximum 1
Zeros 549
Zeros (%) 61.6%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 1
95-th Percentile 1
Maximum 1
Range 1
IQR 1

Descriptive Statistics

Mean 0.3838
Standard Deviation 0.4866
Variance 0.2368
Sum 342
Skewness 0.4777
Kurtosis -1.7718
Coefficient of Variation 1.2677

Pclass

numerical

Distinct Count 3
Unique (%) 0.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 13.9 KB
Mean 2.3086
Minimum 1
Maximum 3
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 1
5-th Percentile 1
Q1 2
Median 3
Q3 3
95-th Percentile 3
Maximum 3
Range 2
IQR 1

Descriptive Statistics

Mean 2.3086
Standard Deviation 0.8361
Variance 0.699
Sum 2057
Skewness -0.6295
Kurtosis -1.2796
Coefficient of Variation 0.3621

Name

categorical

Distinct Count 891
Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Memory Size 80.0 KB

Length

Mean 26.9652
Standard Deviation 9.2816
Median 25
Minimum 12
Maximum 82

Sample

1st row Braund, Mr. Owen H...
2nd row Cumings, Mrs. John...
3rd row Heikkinen, Miss. L...
4th row Futrelle, Mrs. Jac...
5th row Allen, Mr. William...

Letter

Count 19091
Lowercase Letter 15446
Space Separator 2735
Uppercase Letter 3645
Dash Punctuation 13
Decimal Number 0

Sex

categorical

Distinct Count 2
Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Memory Size 60.7 KB

Length

Mean 4.7048
Standard Deviation 0.956
Median 4
Minimum 4
Maximum 6

Sample

1st row male
2nd row female
3rd row female
4th row female
5th row male

Letter

Count 4192
Lowercase Letter 4192
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0

Age

numerical

Distinct Count 88
Unique (%) 12.3%
Missing 177
Missing (%) 19.9%
Infinite 0
Infinite (%) 0.0%
Memory Size 11.2 KB
Mean 29.6991
Minimum 0.42
Maximum 80
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 0.42
5-th Percentile 4
Q1 20.125
Median 28
Q3 38
95-th Percentile 56
Maximum 80
Range 79.58
IQR 17.875

Descriptive Statistics

Mean 29.6991
Standard Deviation 14.5265
Variance 211.0191
Sum 21205.17
Skewness 0.3883
Kurtosis 0.1686
Coefficient of Variation 0.4891

SibSp

numerical

Distinct Count 7
Unique (%) 0.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 13.9 KB
Mean 0.523
Minimum 0
Maximum 8
Zeros 608
Zeros (%) 68.2%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 1
95-th Percentile 3
Maximum 8
Range 8
IQR 1

Descriptive Statistics

Mean 0.523
Standard Deviation 1.1027
Variance 1.216
Sum 466
Skewness 3.6891
Kurtosis 17.7735
Coefficient of Variation 2.1085

Parch

numerical

Distinct Count 7
Unique (%) 0.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 13.9 KB
Mean 0.3816
Minimum 0
Maximum 6
Zeros 678
Zeros (%) 76.1%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 2
Maximum 6
Range 6
IQR 0

Descriptive Statistics

Mean 0.3816
Standard Deviation 0.8061
Variance 0.6497
Sum 340
Skewness 2.7445
Kurtosis 9.7166
Coefficient of Variation 2.1123

Ticket

categorical

Distinct Count 681
Unique (%) 76.4%
Missing 0
Missing (%) 0.0%
Memory Size 62.4 KB

Length

Mean 6.7508
Standard Deviation 2.7455
Median 6
Minimum 3
Maximum 18

Sample

1st row A/5 21171
2nd row PC 17599
3rd row STON/O2. 3101282
4th row 113803
5th row 373450

Letter

Count 673
Lowercase Letter 21
Space Separator 239
Uppercase Letter 652
Dash Punctuation 0
Decimal Number 4808

Fare

numerical

Distinct Count 248
Unique (%) 27.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 13.9 KB
Mean 32.2042
Minimum 0
Maximum 512.3292
Zeros 15
Zeros (%) 1.7%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 0
5-th Percentile 7.225
Q1 7.9104
Median 14.4542
Q3 31
95-th Percentile 112.0791
Maximum 512.3292
Range 512.3292
IQR 23.0896

Descriptive Statistics

Mean 32.2042
Standard Deviation 49.6934
Variance 2469.4368
Sum 28693.9493
Skewness 4.7793
Kurtosis 33.2043
Coefficient of Variation 1.5431

Cabin

categorical

Distinct Count 148
Unique (%) 16.6%
Missing 0
Missing (%) 0.0%
Memory Size 59.3 KB

Length

Mean 3.1347
Standard Deviation 1.021
Median 3
Minimum 1
Maximum 15

Sample

1st row nan
2nd row C85
3rd row nan
4th row C123
5th row nan

Letter

Count 2299
Lowercase Letter 2061
Space Separator 34
Uppercase Letter 238
Dash Punctuation 0
Decimal Number 460

Embarked

categorical

Distinct Count 4
Unique (%) 0.4%
Missing 0
Missing (%) 0.0%
Memory Size 57.4 KB

Length

Mean 1.0045
Standard Deviation 0.0947
Median 1
Minimum 1
Maximum 3

Sample

1st row S
2nd row C
3rd row S
4th row S
5th row S

Letter

Count 895
Lowercase Letter 6
Space Separator 0
Uppercase Letter 889
Dash Punctuation 0
Decimal Number 0

Interactions

Correlations

Missing Values